Tibetan-Chinese Cross Language Text Similarity Calculation Based on LDA Topic Model

نویسندگان

  • Sun Yuan
  • Zhao Qian
  • Pablo Gamallo Otero
چکیده

Topic model building is the basis and the most critical module of cross-language topic detection and tracking. Topic model also can be applied to cross-language text similarity calculation. It can improve the efficiency and the speed of calculation by reducing the texts’ dimensionality. In this paper, we use the LDA model in cross-language text similarity computation to obtain Tibetan-Chinese comparable corpora: (1) Extending Tibetan-Chinese dictionary by extracting Tibetan-Chinese entities from Wikipedia. (2) Using topic model to make the texts mapped to the feature space of topics. (3) Calculating the similarity of two texts in different language according to the characteristics of the news text. The method for text similarity calculation based on LDA model reduces the dimensions of text space vector, and enhances the understanding of the text’s semantics. It also improves the speed and efficiency of calculation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Semantic Similarity Calculation of Chinese Word

This paper puts forward a two layers computing method to calculate semantic similarity of Chinese word. Firstly, using Latent Dirichlet Allocation (LDA) subject model to generate subject spatial domain. Then mapping word into topic space and forming topic distribution which is used to calculate semantic similarity of word(the first layer computing). Finally, using semantic dictionary"HowNet" to...

متن کامل

Study of Chinese Text Similarity Based on Difference Factor in Word-Number

Text similarity calculation is the basic work in the application of Chinese information processing. A highquality text similarity calculation method must be accurate and efficient, that is, it can be able to compare texts from the level of text natural language meaning, and arrive at the similarity distinction similar to artificial reading based on a full understanding of the author or text sou...

متن کامل

Language model adaptation using latent dirichlet allocation and an efficient topic inference algorithm

We present an effort to perform topic mixture-based language model adaptation using latent Dirichlet allocation (LDA). We use probabilistic latent semantic analysis (PLSA) to automatically cluster a heterogeneous training corpus, and train an LDAmodel using the resultant topicdocument assignments. Using this LDA model, we then construct topic-specific corpora at the utterance level for interpol...

متن کامل

Hot Topic Extraction and Public Opinion Classification of Tibetan Texts

The increasing amount of Tibetan information has made Tibetan text processing popular and highly significant. In this study, Tibetan hot topic extraction and public opinion classification were investigated to accelerate the development of Tibetan information processing. First, Tibetan word segmentation in Tibetan hot topic extraction was presented. Second, feature selection based on term freque...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015